Neural Networks Attempt to Mimic Several Functions

Team Leader: Austin Derrow-Pinion

Team Members: Kice Sanders, Aaron Bartee

Executive Summary

In this study, we attempt to compare the power of neural networks’ ability to mimic functions with the functions themselves. The Universal Approximation Theorem states that for any continuous function, there exists a feed-forward network with only a single hidden layer that can approximate it. This is motivation for us to try out different functions and try to train a neural network to accurately approximate as many as we can. For functions that are not approximated well, we can observe the function and try to learn more about neural networks as to why it did not learn the function.

If the neural networks are able to extrapolate and perform well, we will try to analyze it's ability to be used in a business application. If neural networks are able to perform a function faster than the function itself, it could save a lot of money for companies that rely and quick computing. This has the possibility to introducing a whole new way of computing. Replacing functions with neural networks because sometimes the functions become too complex and take much longer than a neural network at executing.

Introduction

The overall objective is to explore the complexity of problems which are able to be solved by applying learning with neural networks. We have programmed several different functions in Python, all of which range in complexity. We can have a loop that feeds in a very large number of inputs to these functions and records the output in order to generate a large amount of data. The programs were made by us so we can generate as much data as we need to train the neural network.

With this data, we will use supervised learning by feeding the network with the input data and using back-propagation to update the weights in the network. The neural network will be programmed using skFlow from TensorFlow's library.

Data Preparation

Since all functions are written by us, we are able to generate all possible inputs in a given range for each function and record the output. This allows us to have as much data as needed to observe the performance of the neural network.

Below are examples of how we will generate the data from the functions. Generating all possible examples will allow us to fully train the network as much as we can. We split the data, for example into 90% training and 10% testing, and test the fully trained neural network on the testing data to analyze how well it generalized the function.


In [1]:
import numpy as np
from trainingFunctions import *

# Inputs are 2x2 integer matrices, output is the determinant.
examples = 20
possibilities = np.zeros(examples)
batchInput = np.zeros(shape=(examples**4,4))
batchTarget = np.zeros(examples**4)

for i in range(examples):
    possibilities[i] = i

target_i = 0
for h in range(examples):  
    for i in range(examples):  
        for j in range(examples):    
            for k in range(examples):
                batchInput[target_i][0] = possibilities[h]
                batchInput[target_i][1] = possibilities[i]
                batchInput[target_i][2] = possibilities[j]
                batchInput[target_i][3] = possibilities[k]
                batchTarget[target_i] = determinant([[batchInput[target_i][0], batchInput[target_i][1]], [batchInput[target_i][2], batchInput[target_i][3]]])
                target_i += 1
print("sample input: " + str(batchInput[95]))
print("sample output: " + str(batchTarget[95]))


sample input: [  0.   0.   4.  15.]
sample output: 0.0

In [2]:
# Fill the input array, x, with numbers 1 to TRAINING_EXAMPLES representing the indexes in the fib sequence
# Fill the output array y, with the first TRAINING_EXAMPELS fibonacci numbers
TRAINING_EXAMPLES = 10
x = np.zeros(TRAINING_EXAMPLES)
y = np.zeros(TRAINING_EXAMPLES)
for i in range(TRAINING_EXAMPLES):
    x[i] = i + 1
    y[i] = fib(i + 1)
print("Training data for fib function:")
print(x)
print(y)


Training data for fib function:
[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.]
[  1.   1.   2.   3.   5.   8.  13.  21.  34.  55.]

In [3]:
# Fill the input array, x, with the binary representation of 0 to TRAINING_EXAMPLES
# Fill the output array y, with the output of evenParity() function with each 16-bit integer
TRAINING_EXAMPLES = 10
x = np.zeros((TRAINING_EXAMPLES, 16), dtype='int32')
y = np.zeros(TRAINING_EXAMPLES, dtype='int32')
for i in range(TRAINING_EXAMPLES):
    temp = bin(i)[2:].zfill(16)
    x[i] = [int(j) for j in temp]
    y[i] = evenParity(i)
print("Training data for evenParity function:")
print(x)
print(y)


Training data for evenParity function:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1]]
[0 1 1 0 1 0 0 1 1 0]

Analysis

addThem(n, m): Calculates the sum of n and m.

Using the skflow library, I trained a TensorFlowDNNRegressor with two hidden units to provide the same output as the function.
Process:

  • Generated 1,000,000 random examples of adding numbers between 1 and 500.
    • Trained over 100,000 iterations.
    • Provided around 80% error.
    • We believed this was because the examples were too random and was unable to see all possible cases to generalize.
  • Generated all possible cases of adding numbers between 1 and 500.
    • Trained over 100,000 iterations.
    • Neural network was able to extrapolate past the input data and have 100% accuracy up to around 700.
  • Generated all possible cases of adding numbers between 1 and 1,000.
    • Trained over 100,000 iterations.
    • Provided an error rate of 0.107, which means it generalized fairly well, but not perfectly.
  • Generated all possible cases of adding numbers between 1 and 10,000.
    • Trained it over 2,000,000 iterations.
    • Provided an error rate of 49.00%, which suggests a single hidden layer would be unable to generalize on a large scale.

Adding two numbers together is proved to be a simple task for neural networks to learn. In order to teach them on a larger scale, it would be necessary to modify the architecture by adding more hidden layers.

multiply(n, m): Calculates the product of n and m.

Using the skflow library, I trained a TensorFlowDNNRegressor with two hidden units to provide the same output as the function.
Process:

  • Generated all possible cases of n and m ranging from 1 to 10.
    • Trained over 1,000 Iterations.
      • Error was 96%, did not learn much at all.
    • Trained over 10,000 iterations.
      • Error was 95%, not much of a difference.
    • Trained over 50,000 iterations.
      • Error was still up at 92%.
    • Trained over 100,000 iteration.
      • Error was at 90% showing a small decrease in error rate as more training occurs.
    • Trained over 500,000 iterations.
      • Error was at 91%, which tells me the neural network wasn't learning after all.
    • Trained over 1,000,000 iterations.
      • Error was at 92%. At this point, I knew for sure that the number of iterations was not effecting the error rate anymore.

This function, with this error rate needed to try out a new architecture.

We ran 200 different processes, each one training on how to mimic the function multiply(n, m) with all possible ranges between 1 and 10. Each Neural Network Regressor trained with 100,000 steps. Before doing this, I had a hypothesis that the best number of units in the hidden layer is the square of the max number in the range. For example, since these range from 1 to 10, my hypothesis is the best number of units in the hidden layer is 10^2, or 100.

The graph below displays the results of this analysis.

  • At 90 hidden units, it reaches an accurate prediction with MSE of 0.148.
  • At 100 hidden units, it predicts with MSE of 0.303
  • It best predicts with 200 hidden units with an MSE of 0.039

This tells me that my hypothesis was very wrong. Because of this, I wondered why having 200 hidden units performed so well. The total data was split into 90% training data and 10% testing. All of the errors were calculated on the test data.

evenParity(n): Outputs a 0 if the number of 1 bits in the binary representation of n is even, else outputs a 1.

Using the skflow library, I trained a TensorFlowDNNClassifier with sixteen hidden units to provide the same output as the function.
Process:

  • Generated every 16 bit integer, 1 - 65,535.
    • Trained over 1,000 iterations.
      • Error rate was at 49.97%, which means the neural network is just guessing since there are only 2 possible outputs.
    • Trained over 10,000 iterations.
      • Error rate was at 50.01%, which means it is still guessing.
    • Trained over 50,000 iterations.
      • Error rate was at 43.69%. It shows some improvement, but nothing too significant.
    • Trained over 100,000 iterations.
      • Error rate was at 18.46%. This is great improvement! Somewhere between 50,000 and 100,000 iterations, the neural network really starts learning.
    • Trained over 500,000 iterations.
      • Error rate was at 3.15%. The neural network is now showing consistent learning. There is room for improvement still though.
    • Trained over 1,000,000 iterations.
      • Error rate was at 3.84. This means that somewhere between 100,000 and 500,000 iterations is equivalent to training over 1,000,000 iterations.

We originally expected this to be a function which is unable to be learned by a neural network. The network we trained used all possible 16 bit integers. By doing this, we were able to construct the architecture to be a fully connected network with 16 input units and 16 output units. By trial and error, I found 16 to be the best number of units in the hidden layer.

We analyzed how quickly it was able to accurately predict on test data. We trained it on 90% of all the possible cases, then tested it on the remaining 10%. The results are shown in the graph below. This showed it leveled off to remain at about the same MSE after 150,000 steps of training. It is a bit difficult to see the differences past then, so I have provided another graph zoomed in. Even though the MSE fluctuates a bit, the neural network is trained as accurately as it can with the given data anywhere in this range. That is because of slight overfittings caused by continuous training on the same data.

Overall, we were very impressed with the neural network's ability to learn this function.

adder(n): Adds 42 to n

Process:

  • Tried different methods of generating data to find which worked the best
    • All possible values 0-100
    • 1000 random values 0-100
    • All possible values 0-100 10 times for a total of 1000 datum
    • All possible values 0-100 100 times for a total of 10000 datum
  • Experimented with single layer hidden units 1-20
  • Experimented with two layer hidden units [1-20, 1-20]
  • Found how well each different neural net extrapolated to other values
  • Tried to scale data up to learn 1-1000

Results:

Data MSE
100 data: 1-100 .065861
1000 datum: random(100) .028475
1000 datum: 1-100 10 times .007759
10000 datum: 1-100 100 times 409.116

The results seemed to show that iterating through every possibility multiple times and then training on that is the best method of data gathering. As with many things in this, you have to find the fine line of having the right amount of data without having too much.

I knew that this function would be slower than Python's built in add, because it has to do matrix multiplication. I wanted to know how much slower. When adding low numbers, Python's add was up to 2000 times faster. As the numbers grew to be larger, however, Python's add was only a few hundred times as fast.

I was interested in how far a neural net could extrapolate to numbers that it had never seen before. Although this should have been an easy problem because it is linear, it wasn't because SkFlow doesn't allow you to choose the activation function for your regressor. Most all of the single and double layer neural nets I trained failed around 200-300, however there was one that stood out and was able to correctly predict 1-1145 with just training on 1-100

The figure below shows how well the neural net is able to extrapolate after 100 steps, and after 10000 steps. After 10000 steps, the lines are very similar between the neural net and the real, however the neural net always starts to add not enough or a little too much everytime, stopping it's ability to extrapolate any further. It is worth noting, however, that the regressor knows that it is a linear relationship, and treats it as such.

determinant(m): Computes the determinant of matrix m

  • Started by generating every possible 2x2 matrix with the integers 1-20
  • Ran MSE tests on every possible single hidden layer and double layer of increments 100 from 100-3000
  • Many had similar MSE and all performed poorly on accuracy tests. The lowest error percentage I could achieve was 45%
  • Neural network only took 8-30 times longer to compute determinant than NumPy's determinant function

My main goal when starting to look at determinants was to truly test the extrapolation power of the neural network that I was using. I wanted to train on a 2x2 matrix, and see if it could extrapolate to a 3x3. However, I couldn't achieve good enough results from training on a 2x2. The poor results were with optimal conditions. I was able to train it on every possibility of integers from 1 to 20. I tested many different hidden layers. For one layer, below is a graph displaying the MSE after 10000 steps.

I decided to scrap the idea of even trying to do a 3x3, knowing that either I would have to train for very long because of the ammount of data I would have (20^9) if I were still using integers from 1-20. Resorting to using random integers produced an even worse result because there were too many cases that the neural network would never see.

Overall, I would say that trying to train a neural network with SkFlow to learn how to compute a determinant was a failure in any practical sense because the lowest error that was produced was 45%. Training longer was not helping the neural net extrapolate further from the values that it had already seen, rather it was just learning every example that it saw. If that is the case, a lookup table would be more efficient and accurate.


In [5]:
import pandas as pd
from bokeh.charts import Scatter, show
from bokeh.plotting import figure
from bokeh.io import output_notebook
output_notebook()
df = pd.read_csv('../MA490-MachineLearning-FinalProject//sineData5.csv')

fig = Scatter(df[20:80], x='Input',y='Prediction',color='blue')
f = figure()
f.line(df.Input.values[18:82], df.Prediction.values[18:82], color='blue')
x = np.linspace(-3*np.pi-np.pi/2, 3*np.pi+np.pi/2, 100)
y = np.sin(x)
f.line(x,y,color='red')
x = [-3*np.pi,-3*np.pi]
y = [-3,3]
f.line(x,y,color='black')
x = [3*np.pi,3*np.pi]
y = [-3,3]
f.line(x,y,color='black')


BokehJS successfully loaded.
Out[5]:
<bokeh.models.renderers.GlyphRenderer at 0x95a7cf8>

Sine Graph


In [6]:
show(f)


Out[6]:
<bokeh.io._CommsHandle at 0x94e7278>

The above graph has in red the actual sine wave and in blue the neural net prediction of 100 points between roughly -5π and 5π. The black lines are the area of the training range, which is where the prediction is much more accurate while on the outside it diverges to infinity. This was the original prediction using 9 hidden units and 10000 numbers between -3π and 3π. We used a more accurate model later.

sine(x): Calculates the sine of x.

Using skflow and its library we trained a TensorFlowDNNRegressor with 9 hidden units.
Process:

  • Originally trained using random data by picking random numbers between 0 and 1000 and converting to radians

    • Trained over 10,000 iterations.
    • Used 2 hidden units.
    • High error, mean squared error of 0.609.
  • Next used 0 to 720 degrees and fed all the values to the net.

    • Trained over 10,000 iterations.
    • Used 2 hidden units.
    • Still had very high error, didn't seem to learn very well. Mean squared error of 0.499.
  • Generated 10000 numbers between -π to π and fed the neural network the sine taylor expansion (9 of the terms) for each value.

    • Trained over 50,000 iterations.
    • Used 9 hidden units because the input had 9 terms.
    • Was much more accurate than the previous attempts, can predict in the range between -pi and pi almost spot on. Mean squared error between -π to π was 0.00002761.
    • As you increase the range the prediction becomes less accurate and heads off to infinite and can't seem to generalize the sine curve. Mean squared error of an astounding 28556.8 between -2π and 2π.
  • Generated 50000 numbers between -3π to 3π and tried several different variations of hidden units.

    • Trained over 100,000 iterations
    • Using 2 layers of 6 hidden units:
      • Accurate between -3π and 3π with a mean squared error of 0.000468
      • Can't generalize the curve, the left side goes to negative infinity and the right to positive infinity. Mean squared error of 9139.7 between -4π and 4π.
    • Using a single layer of 729 hidden units.
      • Accurate between -3π and 3π with a mean squared error of 0.016726, not as accurate as the 2 layered neural net.
      • Still can't seem to generalize but has a better mean squared error between -4π and 4π of 858.7. Interestingly, both sides go to positive infinity i with this net, and on the negative side of the curve it appears it will head to negative infinity but then curves to positive infinity.

In [7]:
df = pd.read_csv('../MA490-MachineLearning-FinalProject/sineData(pi).csv')
f = figure()
f.line(df.Input.values, df.Prediction.values, color='blue')
x = np.linspace(-np.pi, np.pi, 100)
y = np.sin(x)
f.line(x,y,color='red')
x = [-np.pi,-np.pi]

The red in the graph below is an actual sine wave while the blue is an prediction using a neural net described above using numbers between -π and π. As you can see it is very accurate.


In [8]:
show(f)


Out[8]:
<bokeh.io._CommsHandle at 0x95a7978>

However, the neural net used above could not extrapolate anything past -π and π and just went to infinity. The graph below using 50000 numbers betwee -3π and 3π and 729 hidden units was the closest to extrapolating the sine wave yet still failed.


In [9]:
df = pd.read_csv('../MA490-MachineLearning-FinalProject/sineData8.csv')
f = figure()
f.line(df.Input.values, df.Prediction.values, color='blue')
x = np.linspace(-4*np.pi, 4*np.pi, 1000)
y = np.sin(x)
f.line(x,y,color='red')
x = [-3*np.pi,-3*np.pi]
y = [-2,2]
f.line(x,y,color='black')
x = [3*np.pi,3*np.pi]
y = [-2,2]
f.line(x,y,color='black')


Out[9]:
<bokeh.models.renderers.GlyphRenderer at 0x9631748>

The graph below, using 50000 numbers betwee -3π and 3π and 729 hidden units, is a prediction of 100 points between -4π and 4π. The black lines is the area of which it was trained on. It almost extrapolated the sine wave but still fails completely.


In [10]:
show(f)


Out[10]:
<bokeh.io._CommsHandle at 0x95a7dd8>

Conclusion

In conclusion, we found out skFlow is only good if you want a quick and easy neural net to learn data, but it is unrealistic for actual applications because it is too specialized and not very customizable. The next step in our neural network exploration would be to learn how to use TensorFlow so that we can customize it based on our needs and optimize it better.

We were able to train with reasonable accuracy on linear functions, continuous functions, and a sine function which has clear repetition. Some of them were able to extrapolate further past the training data, which tells us it did a great job at generalizing the function overall. Others we are currently unsuccessful with extrapolating much past the training data. We had a difficult time trying to find a function it was not able to approximate in the given training range. The only one unable to learn was an approximation function for the fibonacci sequence. The fibonacci sequence is best learned by a recurrant neural network, which is not in the scope of this class.

We define several different functions we can use to train a neural network in the ProjectReportSupplement.ipynb notebook.

The Universal Law of Approximation is explained here: [http://neuralnetworksanddeeplearning.com/chap4.html]

Our code is open sourced on GitHub here: [https://github.com/derrowap/MA490-MachineLearning-FinalProject/]


In [ ]: